Name: Syeduzzaman Khan
Project Title: A recipe for success as a movie producer
Movies are one of the entertainment sources of our time. The history of the movie is century old. From the beginning to till date movies attract viewers all over the world. At present, all most each and every country has a movie industry. The project dataset contains the movie industry's data from 1986 to 2016.
The purpose of this project is to analysis the movie industry's over 30 years of data and explains the decision to Steven Spielberg. Therefore, he can invest in the film based on the analyzed data.
To achieve Steven Spielberg's objective, we have to carefully analyze the dataset considering different perspectives. At the beginning of the project, we will get too familiar with the dataset and look into different columns. Then, we will make logical segmentation based on interesting features that will lead us to reach our goals. The obtained results will be visualized using static and dynamic plots.
Read Dataset:
# read csv file
dataset <- read.csv ("movies.csv", na.strings="",stringsAsFactors=FALSE)
2.1 Print top 5 rows
head (dataset,n=5) # print top 5 rows
2.2 Number of rows and Number of columns:
cat ("Number of Rows: ",nrow(dataset)) # calculate number of rows
cat ("\nNumber of Columns: ",ncol(dataset)) # calculate number of coloumns
The dataset is crediable. It has moderate rows and columns.
2.3 Top Five Budget Films:
Budget<-dataset$budget # sotre budget
Budget[Budget=="NA"] <- "0" # missing values set to zero
Budget<-as.numeric(Budget)# convert to number
head(sort(Budget,decreasing=TRUE),5)
The amount of top budegt films were really high. Normally, science fiction or war films need more budget for making.
2.4 Long Run time [hrs]
Runtime <-dataset$runtime # sotre budget
Runtime[Runtime=="NA"] <- "0" # missing values set to zero
Runtime<-as.numeric(Runtime)# convert to number
head(sort(Runtime/60,decreasing=TRUE),5)
The highest runtime was 6.1 hours that was kind of absurd. I donot know who can keep his patient till 6 hours.
3.1 Highest profitable movie genre-> gross- budget & flim type
Gross<-dataset$gross # sotre budget
Gross[Budget=="NA"] <- "0" # missing values set to zero
Gross<-as.numeric(Gross)# convert to number
profit<-head(sort(Gross-Budget,decreasing=TRUE),5)
film_genere<-0
for (i in 1:length(profit))
{
for (j in 1:nrow(dataset))
{
if (profit[i]==dataset$gross[j]-dataset$budget[j])
{
film_genere[i]<-dataset$genre[j]
}
}
}
Top profitable film Genre:
head(film_genere,5)
The top mentioned films are the most profitable film genre in future investment.
3.2 Highest User Rating and company
score<-dataset$score # sotre budget
score[score=="NA"] <- "0" # missing values set to zero
Rating<-as.numeric(score)# convert to number
score_array<-head(sort(score,decreasing=TRUE),5)
company_h_rating<-0
for (i in 1:length(score_array))
{
for (j in 1:nrow(dataset))
{
if (score_array[i]==dataset$score[j])
{
company_h_rating[i]<-dataset$company[j]
}
}
}
Highest user rating Film Companies:
head(company_h_rating,5)
The above film companies have most popular films over the time period 1986 to 2016.
3.3 Top voted Actor
# top votes
votes<-dataset$votes # sotre budget
votes[votes=="NA"] <- "0" # missing values set to zero
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
a<-0
for (i in 1:length(votes_array))
{
for (j in 1:nrow(dataset))
{
if (votes_array[i]== dataset$votes[j])
{
a[i]<-dataset$star[j]
#print(dataset$star[j])
}
}
}
head(a,5)
The above actors are top voted actors for a specific movie.
3.4 Top voted Writer
votes<-dataset[dataset$year>2010,c("votes")] # votes after 2010
votes[votes=="NA"] <- "0" # missing values set to zero
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
for (i in 1:length(votes_array))
{
for (j in 1:nrow(dataset))
{
if (votes_array[i]== dataset$votes[j])
{
a[i]<-dataset$writer[j]
#print(dataset$star[j])
}
}
}
Top rated Writter:
head(a,5)
The above data row shows the most popular screenplay writer after 2010.
3.5 Top voted Director
votes<-dataset[dataset$year>2010,c("votes")] # votes after 2010
votes[votes=="NA"] <- "0" # missing values set to zero
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
for (i in 1:length(votes_array))
{
for (j in 1:nrow(dataset))
{
if (votes_array[i]== dataset$votes[j])
{
a[i]<-dataset$director[j]
#print(dataset$star[j])
}
}
}
Top popluar Director:
head(a,5)
The above data row shows the most popular director after 2010.
The dataset has segmented logically based on the following conditions:
Film Genre
Production Company
Popular Actor
Popular Writer
Director
To make a movie, the above-mentioned data are the top considered elements to think about. We need to select a film genre, production company, actor, writer, and director. Together with all characteristics, it is possible to make a popular movie.
4.1 Pie chart -> gross income by movie genre
gross<-dataset$gross # votes after 2010
gross_action<-dataset[dataset$genre=="Action",c("gross")] # votes after 2010
gross_Adventure<-dataset[dataset$genre=="Adventure",c("gross")] # votes after 2010
gross_Comedy<-dataset[dataset$genre=="Comedy",c("gross")] # votes after 2010
gross_Drama<-dataset[dataset$genre=="Drama",c("gross")] # votes after 2010
others=sum(gross)-(sum(gross_action)+sum(gross_Adventure)+sum(gross_Comedy)+sum(gross_Drama))
x <- c(sum(gross_action),sum(gross_Adventure),sum(gross_Comedy),sum(gross_Drama),others)
labels <- c("Action","Adventure","Comedy","Drama","Others")
piepercent<- round(100*x/sum(x), 1)
pie(x, labels = piepercent, main = "Gorss Income",col = rainbow(length(x)))
legend("topright", c("Action","Adventure","Comedy","Drama","Others"), cex = 0.8,
fill = rainbow(length(x)))
x1<-x
The pie chart represents the gross income by percentage. Action genre holds the number one position by scoring about 33% of gross income. Comedy, drama, and adventure are the 2nd, 3rd, and 4th positions. The other category makes almost 23.6% gross income.
4.2 Scatter chart-> Score vs Runtime
# scatter plot
plot(x = dataset$score,y = dataset$runtime,
xlab = "Score",
ylab = "Runtime",
xlim = c(2,10),
ylim = c(50,400),
main = "Score vs Runtime"
)
The scatter plot shows the score vs runtime plot. The average runtime is about 120min which gets average 7.5 score.
4.3 Bar Chart: Avg Budget vs Decade
decade1<- mean(dataset[dataset$year<=1995,c("budget")])
decade2<- mean(dataset[dataset$year<=2005,c("budget")])
decade3<- mean(dataset[dataset$year<=2016,c("budget")])
v<-c(decade1,decade2,decade3)
v<-v/1000000
M <- c("1986-1995","1996-2005","2006-2016")
barplot(v,names.arg=M,ylim = c(0,40),xlab="Decade",ylab="Budget [Million]",
main="Budget")
The bar chart expresses the average budget data over the decades. By the way, the average budget has increased over the time.
4.4 Box PLot : Actor vs Gross
votes<-dataset$votes #
votes[votes=="NA"] <- "0" # missing values set to zero
votes<-as.numeric(votes)# convert to number
votes_array<-head(sort(votes,decreasing=TRUE),5)
#votes_array
# actor
a<-0
for (i in 1:length(votes_array))
{
for (j in 1:nrow(dataset))
{
if (votes_array[i]== dataset$votes[j])
{
a[i]<-dataset$star[j]
}
}
}
dataset1<- dataset[dataset$star=="Christian Bale" ,c("star","gross")]
boxplot( (dataset1$gross)/1000000, xlab = "Christian Bale",ylab = "Film Gross Income ", main = "Gross Income of top Star films",names=c("Christian Bale"),ylim=c(0,220))
The above whisker box plot shows the corrosponding film gross income of Christian Bale. The highest income of his film's is about 210 million USD.
4.5 qplot : Vote vs Score
vote<-dataset$vote
score<-dataset$score
qplot(score,vote,data=dataset,geom=c("point","line"),color=("red"))
The above qplot is drawn for score vs votes data of films. The number of vote and score are proportional.
4.6 Histrogram of Movie Runtime
hist(dataset$runtime,
main="Histogram for runtime",
xlab="Runtime",
border="blue",
col="green",
xlim=c(0,500),
ylim=c(0,4000),
las=1,
breaks=5)
The histrogram is plotted for movie runtime. The most of the films runtime sit between 100 to 150 min.
4.7 Plot function: Votes vs Gross Income
# Plot
v<-dataset$votes
w<-dataset$gross
plot(v,w/1000000,col="red", lwd=5, xlab="votes", ylab="gross", main="Votes vs Gross income")
The above plot shows the votes vs gross income data of movies. The relationship is proportional.
4.8 Density function of score
plot(density(dataset$score))
The density plot of films score is plotted as the above graph. The density function reaches the max at score 6.8 and density of 0.4
4.9 3D Box plot
if (!require("scatterplot3d")) install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(dataset$budget,dataset$votes,dataset$vote, color=as.integer(dataset$votes))
The 3D scatter plot is plotted for budget, score and vote of each films. Most of the low budget films hold the lowest score and votes.
library(plotly)
5.1 3D plot dynamic plot of score , votes, and budget
data<-dataset[c("score","votes","budget")]
with(data, plot_ly(data, x = score, y= votes, z = budget,
size = votes,
type="scatter3d", mode="markers"))
The above 3D dynamic plot is for score, votes, and budget.
5.2 Dynamic Bar chart
plot_ly(x = M,y = v,name = "Budget over Decades",type = "bar")
The dynamic bar chart is drawen using plot_ly library. The same graph is also plotted for static plot. Average budget of a movie is increased over the decades.
5.3 Dynamic Pie/donuts Chart: Gross income by film genere
x2<-c("Action","Adventure","Comedy","Drama","Others")
plot_ly(labels = x2, values =piepercent ) %>%
add_pie(hole = 0.6) %>%
layout(title = "Gorss Income by Genre", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
The donuts chart is plotted for gross income by movie genre.Action genre holds the number one position by scoring about 33% of gross income. Comedy, drama, and adventure are the 2nd, 3rd, and 4th positions. The other category makes almost 23.6% gross income.
5.4 Bubble chart-> score vs runtime
data3<-dataset[c("score","runtime")]
plot_ly(data3, x = ~score, y = ~runtime, type = 'scatter', mode = 'markers',
marker = list( opacity = 0.5)) %>%
layout(title = 'Score vs Runtime',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE))
The above bubble chart is the representation of score vs runtime. The relationship between the entities is proportional.
5.5 Dynamic Horizontal Box plot
plot_ly(x = ~rnorm(dataset$gross), type = "box") %>%
add_trace(x = ~rnorm(dataset$score))
The above plot represents the gross income (trace 0) and score (trace 1).
The dataset provides a good insight into movies history. The dataset is reliable. Therefore, the chance to get the most optimal solution for small projects is moderate. The dataset has been segmented using genre, production company, actor, writer, and director. Together with all characteristics, it is possible to make a popular movie.
Action genre movie is the highest gross earning types. The top production houses are Castle Rock Entertainment, Warner Bros, and New Line Cinema. Popular actors based on people's votes are Christian Bale, Leonardo DiCaprio, Brad Pitt, and John Travolta.
The screenplay plays a vital role to succeed in a movie. Top writers are Jonathan Nolan, Quentin Tarantino, Joss Whedon, and Terence Winter. Without a perfect direction, the film never will be a watchable film. Therefore, the selection of the right director is important. The top enlisted film directors are Christopher Nolan, Quentin Tarantino, Joss Whedon, and Martin Scorsese.
To produce a blockbuster film, Steven Spielberg can consider the following suggestions:
1. Genre: Action
2. Production company: Warner Bros
3. Actor: Christian Bale
4. Director: Christopher Nolan
5. Writer: Jonathan Nolan
6. Runtime: 110 to 140 min